ICPE 689 Data Science Fundamental for Energy II¶

Some of the examples and exercises of this course are based on two popular books on data science with Python, Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow and the Python Data Science Handbook.

Prediction Methods¶

Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modelling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.

Technically, many machine learning methods are predictive to some extent, but regression models are the mainstay for predictive analytics. Furthermore, if time is one of the independent variables, time series analysis could be used for predictive analystics as well.

With that in mind, this module consists of two parts:

  1. Regression Analysis

  2. Advanded Time Series Analysis

In part 2, after we briefly review the basics of time series analysis, we will demonstrate the usage of Facebook Prophet through a real world example.

Linear Regression Models¶

The linear model implies that the target is specified as a linear combination of features. This section will cover the content listed below:

  • 1 Simple Linear Regression
  • 2 Multiple Linear Regression
  • 3 Polynomial Regression
  • 4 Regularization
  • 5 Linear Classifier and Logistic Regression
  • 6 Hands-on Exercise

Simple Linear Regression¶

Simple linear regression lives up to its name: it's a very straightforward approach for predicting a quantitative $Y$ on the basis of a single predcitor variable $X$, i.e., a straight-line fit to data. . It assumes that there is an approximately a linear relationship between $X$ and $Y$. Mathematically, we can write this linear relationship as $$ Y = \beta_0 + \beta_1 X + \epsilon $$

  • $\epsilon$ represents the randomness term.
  • Sometimes, the equation above is describes by saying that we are regressing $Y$ on $X$ (or $Y$ onto X).
  • $\beta_0$ and $\beta_1$ are two unknown constants that represent the intercept and slope terms in the linear model. Together, $\beta_0$ and $\beta_1$ are known as the model coefficients or parameters.
  • Once we have used our dataset to produce estimate $\hat{\beta}_0$ and $\hat{\beta}_1$ for the model coefficients, we can predict the $Y$ on the basis of $X = x$ which is computed as $$ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x. $$

The least squares approach chooses $\hat{\beta}_0$ and $\hat{\beta}_1$ to minimize the RSS. Using some calculus, we can show that the minimizers are $$ \hat{\beta}_1 = \frac{\sum_{i = 1}^n(x_i-\bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})^2}\\ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}, $$ where $\bar{y} = \frac{1}{n}\sum_{i = 1}^ny_i$ and $\bar{x} = \frac{1}{n}\sum_{i = 1}^nx_i$ are sample means.

Estimate the Coefficients¶

In practice, $\beta_0$ and $\beta_1$ are unknown. So before we can use the model to make predictions, we must use data to estimate the coefficients. Let $$ (x_1,y_1), (x_2,y_2), \cdots, (x_n,y_) $$ represent $n$ observation pairs, each of which consists of a measurement of $X$ and a measurement of $Y$. The most common approach to estimate the coefficients involves minimizing the least squares criterion. The details are presented below,

Let $\hat{y}_i = \beta_0 + \beta_1x_i$ be the prediction for $Y$ based on the ith value of $X$. The the $e_i = y_i - \hat{y_i}$ represents the ith residual - this is the difference bewteen the $ith$ observed response value and the ith response value that is predicted by the model. The residual sum of squares (RSS) is defined as $$ RSS = e_1^2+e_2^2+\cdots+e_n^2,. $$

Matrix Representation¶

In the simple linear regression, we aim to predict the response for the ith individual, $y_i$ using the individual's value of a single predictor variable, $x_i$. The form of the model is given by: $$ y_i = \beta_0 + \beta_1x_i + \epsilon_i $$ which comprises a deterministic component involving the regression coefficients $\boldsymbol{\beta} = (\beta_0,\beta_1)$ and a random component involving the error term $\epsilon_i$.

Let $X$ be the design matrix and $Y$ be the response vector such that $$X=\left[ \begin{matrix} 1 & x_1\\ 1 & x_2 \\ \ldots \\ 1 & x_n\\ \end{matrix} \right],\ Y=\left[ \begin{matrix} y_1\\ y_2 \\ \ldots \\ y_n\\ \end{matrix} \right]$$

then $\boldsymbol{\beta}$ can be estimated by $$\hat{\boldsymbol{\beta}} = (X^TX)^{-1}X^TY$$

Assess the Accuracy of the Model¶

The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the $R^2$ statistic.

  • Residual Standard Error : Due to the presence of these error terms $\epsilon$, even if we knew the true regression line (i.e. even if $\beta_0$ and $\beta_1$ were known), we would not be able to perfectly predict $Y$ from $X$. The RSE is an estimate of the standard deviation of $\epsilon$. Roughly speaking, it is the average amount that the response will deviate from the true regression line. It is computed using the formula $$ RSE = \sqrt{\frac{1}{n-2}RSS} = \sqrt{\frac{1}{n-1}\sum_{i = 1}^n(y_i-\hat{y}_i)^2} $$ The RSE provides an absolute measure of lack of fit of the model to the data. But since it is measured in the units of Y , it is not always clear what constitutes a good RSE
  • $R^2$ Statistics: The $R^2$ Statistics takes the form of a proportion—the proportion of variance explained—and so it always takes on a value between 0 and 1, and is independent of the scale of Y. The formula to calculate $R^2$ is: $$ R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}, $$ where $TSS = \sum(y_i - \bar{y})^2$ is the total sum of sqaures. TSS measures the total variance in the response Y , and can be thought of as the amount of variability inherent in the response before the regression is performed. An $R^2$ statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression.
In [126]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns

from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score

%matplotlib inline
plt.style.use('seaborn-white')
import warnings
warnings.filterwarnings('ignore')

Example¶

Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

In [3]:
url = 'https://raw.githubusercontent.com/jtao/AdvancedML/main/data/Advertising.csv'
advertising = pd.read_csv(url,usecols=[1,2,3,4])
advertising.head(5)
Out[3]:
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9
In [4]:
# Visualization
X0 = advertising.TV
y = advertising.Sales
plt.scatter(X0,y)
plt.xlabel("TV")
plt.ylabel("sales")
Out[4]:
Text(0, 0.5, 'sales')

We can use Scikit-Learn's LinearRegression estimator to fit this data and construct the best-fit line:

In [5]:
# Simple linear regression
## Fit a simple linear regression
from sklearn import linear_model
regr = linear_model.LinearRegression()
X = X0.values.reshape(-1,1)
regr.fit(X,y)
Out[5]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [6]:
## Make prediction of Y based on a sequence of X value (xfit)
xfit = np.linspace(0,300,1000)
yfit = regr.predict(xfit[:, np.newaxis])
plt.scatter(X, y)
plt.plot(xfit, yfit,color = 'r',linewidth = 3);

The slope and intercept of the data are contained in the regr's fit parameters, which in Scikit-Learn are always marked by a trailing underscore. Here the relevant parameters are coef_ and intercept_:

In [7]:
print("Model slope:    ", regr.coef_)
print("Model intercept:", regr.intercept_)
Model slope:     [0.04753664]
Model intercept: 7.032593549127695

The $R^2$ statistics is calculated below which gives $R^2 = 0.61$, and so just under two-thirds of the variability in sales is explained by a linear regression on TV.

In [8]:
Sales_pred = regr.predict(X)
r2_score(y, Sales_pred)
Out[8]:
0.611875050850071

The code below computes RSS for a number of values of $\beta_0$ and $\beta_1$ using the advertising data with sales as response and TV as the predictor. In each plot, the red dot represents the pair of least squares estimates $(\hat{\beta}_0,\hat{\beta}_1)$. These values clearly minimize the RSS.

In [9]:
# Create grid coordinates for plotting
B0 = np.linspace(regr.intercept_-2, regr.intercept_+2, 50) # beta_0 sequence
B1 = np.linspace(regr.coef_-0.02, regr.coef_+0.02, 50) # beta_1 sequence
xx, yy = np.meshgrid(B0, B1, indexing='xy') # create a mesh spanned by beta_0 and beta_1
Z = np.zeros((B0.size,B1.size)) # create a zero matrix to store the RSS value

# Calculate Z-values (RSS) based on grid of coefficients
for (i,j),v in np.ndenumerate(Z):
    Z[i,j] =((y - (xx[i,j]+X.ravel()*yy[i,j]))**2).sum()/1000

# Minimized RSS
min_RSS = r'$\beta_0$, $\beta_1$ for minimized RSS'
min_rss = np.sum((regr.intercept_+regr.coef_*X - y.values.reshape(-1,1))**2)/1000
print("RSS corresponding the the least squares estimate (min_rss):    ", min_rss)
RSS corresponding the the least squares estimate (min_rss):     2.1025305831313514
In [10]:
fig = plt.figure(figsize=(15,6))
fig.suptitle('RSS - Regression coefficients', fontsize=20)

ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122, projection='3d')

# Left plot
CS = ax1.contour(xx, yy, Z, cmap=plt.cm.Set1, levels=[2.15, 2.2, 2.3, 2.5, 3])
ax1.scatter(regr.intercept_, regr.coef_[0], c='r', label=min_RSS)
ax1.clabel(CS, inline=True, fontsize=10, fmt='%1.1f')

# Right plot
ax2.plot_surface(xx, yy, Z, rstride=3, cstride=3, alpha=0.3)
ax2.contour(xx, yy, Z, zdir='z', offset=Z.min(), cmap=plt.cm.Set1,
            alpha=0.4, levels=[2.15, 2.2, 2.3, 2.5, 3])
ax2.scatter3D(regr.intercept_, regr.coef_[0], min_rss, c='r', label=min_RSS)
ax2.set_zlabel('RSS')
ax2.set_zlim(Z.min(),Z.max())
ax2.set_ylim(0.02,0.07)
Out[10]:
(0.02, 0.07)
In [11]:
# settings common to both plots
for ax in fig.axes:
    ax.set_xlabel(r'$\beta_0$', fontsize=17)
    ax.set_ylabel(r'$\beta_1$', fontsize=17)
    ax.set_yticks([0.03,0.04,0.05,0.06])
    ax.legend()

Multiple Linear Regression¶

Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor. Suppose that we have $p$ distinct predictors, the multiple linear regression model takes the form $$ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots +\beta_pX_p + \epsilon, $$ where $X_j$ represents the $jth$ predictor and $\beta_j$ quantifies the association between that variable and the response.

  • Coefficient Estimation: As was the case in the simple linear regression setting, the regression coefficients $\beta_0, \beta_1,\cdots,\beta_p$ are unknown and can be estimated using the same least squares approach.
  • Model Assessment: Two numerical measures of model fit RSE and $R^2$ are computed and interpreted in the same fashion as for simple linear regression.

Example¶

We consider a linear regression fit to sales using TV and radio as predictors.

In [12]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
X = advertising[['Radio', 'TV']]
y = advertising.Sales
regr.fit(X,y)

print("Coefficient of Radio :", regr.coef_[0])
print("Coefficient of TV :", regr.coef_[1])
print("Model intercept:", regr.intercept_)
Coefficient of Radio : 0.1879942266203091
Coefficient of TV : 0.04575481510107615
Model intercept: 2.9210999124051398
In [13]:
# Create a coordinate grid
Radio = np.arange(0,50)
TV = np.arange(0,300)

B1, B2 = np.meshgrid(Radio, TV, indexing='xy')
Z = np.zeros((TV.size, Radio.size))

for (i,j),v in np.ndenumerate(Z):
        Z[i,j] =(regr.intercept_ + B1[i,j]*regr.coef_[0] + B2[i,j]*regr.coef_[1])
In [14]:
# Create plot
fig = plt.figure(figsize=(10,6))
fig.suptitle('Regression: Sales ~ Radio + TV Advertising', fontsize=20)
ax = axes3d.Axes3D(fig)
ax.plot_surface(B1, B2, Z, rstride=10, cstride=5, alpha=0.4)
ax.scatter3D(advertising.Radio, advertising.TV, advertising.Sales, c='r')
ax.set_xlabel('Radio')
ax.set_xlim(0,50)
ax.set_ylabel('TV')
ax.set_ylim(ymin=0)
ax.set_zlabel('Sales');

Polynomial Regression¶

Historically, the standard way to extend linear regression to settings in which the relationship between the predictors and the response is non-linear has been to replace the standard linear model $$ y_i = \beta_0 + \beta_1x_i + \epsilon_i $$ with a polynomial function $$ y_i = \beta_0+\beta_1x_i+\beta_2x_i^2+...+\beta_dx_i^d + \epsilon_i $$ where $\epsilon_i$ is the error term. This approach is known as polynomial regression.

Matrix Representation¶

Following the notation we defined in the simple linear regression, the polynomial regression model $$ y_i = \beta_0+\beta_1x_i+\beta_2x_i^2+...+\beta_dx_i^d + \epsilon_i $$ can be expressed in matrix form in terms of a design matrix $X$ and a response vector $Y$ such that $$ Y = X\boldsymbol{\beta} + \boldsymbol{\epsilon} $$ where $$ X = \begin{bmatrix} 1 & x_1 & x_1^2 & \dots & x_1^d \\ 1 & x_2 & x_2^2 & \dots & x_2^d \\ 1 & x_3 & x_3^2 & \dots & x_3^d \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \dots & x_n^d \end{bmatrix} $$ Then the vector of estimated polynomial regression coefficients is $ \hat{\boldsymbol{\beta}} = (X^TX)^{-1}X^TY$$ $.

Example¶

The dataset Congress contains the congress's approval ratings from 1974 to 2012. Our goal it to develop a model to describe the trend of the approval ratings.

There are two columns in the dataset, we will use Year as predictor $X$ and Rating as response variable $Y$.

In [15]:
congress = pd.read_csv("https://raw.githubusercontent.com/jtao/AdvancedML/main/data/Congress.csv", header = 0)
Rating = congress.Rating.to_numpy().reshape(-1, 1)
Year = congress.Year.to_numpy().reshape(-1, 1)
congress.head()
Out[15]:
Rating Year
0 10 2012.710959
1 16 2012.628767
2 17 2012.541096
3 15 2012.445205
4 17 2012.379452
In [54]:
# dataset reshape
data_test = np.arange(1974, 2013, step = 0.1).reshape(-1, 1)

# Linear Regression Model: Rating ~ Year
fit_lm = linear_model.LinearRegression()
fit1 = fit_lm.fit(Year, Rating)
y1 = fit1.predict(data_test)

# Polynomial Regression Model: Rating ~ Year + Year^2
from sklearn.preprocessing import PolynomialFeatures
x_poly_2 = PolynomialFeatures(degree = 2).fit_transform(Year)
fit2 = fit_lm.fit(x_poly_2, Rating)
y2 = fit2.predict(PolynomialFeatures(degree = 2).fit_transform(data_test))

# Polynomial Regression Model: Rating ~ Year + Year^2 + Year^3
x_poly_3 = PolynomialFeatures(degree = 3).fit_transform(Year)
fit3 = fit_lm.fit(x_poly_3, Rating)
y3 = fit3.predict(PolynomialFeatures(degree = 3).fit_transform(data_test))
In [64]:
# Visualization (the peak is due to the 911 attack.)
# from IPython.display import IFrame
# IFrame(src='https://news.gallup.com/poll/6793/rally-effect-911-terrorist-attacks-virtually-gone.aspx', width=700, height=300)
plt.scatter(Year, Rating, color = "white", edgecolor = "black", label="Raw Data")
plt.plot(data_test, y1, '-', color = "black", label="Linear")
plt.plot(data_test, y2, '-', color = "red", label="Polynomial Degree = 2")
plt.plot(data_test, y3, '-', color = "green", label="Polynomial Degree = 3")
plt.xlabel("Year")
plt.ylabel("Rating")
plt.legend()
Out[64]:
<matplotlib.legend.Legend at 0x7fb4c991de50>

Regularization¶

The introduction of basis functions (e.g. polynomial basis functions) into the linear regression makes the model much more flexible, but it could lead to over-fitting.

The over-fitting problem often happens when the model is very complex and thereby the training data is over-fit, which means that the model predicts the training data very well, but fails for any previously unseen data (test data).

We will illustrate this problem by a regression problem using Gaussian basis function. Let's first create a data set $(x,y)$ with $$y = sin(x) + \epsilon$$

In [82]:
# Generate data pair(x,y) such that y = sin(x) + epsilon
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.1 * rng.randn(50)
x_t = np.linspace(0,10)
plt.plot(x_t, np.sin(x_t), label="$y=sin(x)$")
plt.scatter(x, y, c="r", label="$y=sin(x)+\epsilon$")
plt.legend();
In [65]:
# The Gaussian basis functions are not built into Scikit-learn, therefore, we define a custome
# transformer class in this code chunk. 
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class GaussianFeatures(BaseEstimator, TransformerMixin):
    """Uniformly spaced Gaussian features for one-dimensional input"""
    def __init__(self, N, width_factor=2.0):
        self.N = N
        self.width_factor = width_factor
# @classmethod and @staticmethod are two python decorators
    @staticmethod
    def _gauss_basis(x, y, width, axis=None):
        arg = (x - y) / width
        return np.exp(-0.5 * np.sum(arg ** 2, axis))
        
    def fit(self, X, y=None):
        # create N centers spread along the data range
        self.centers_ = np.linspace(X.min(), X.max(), self.N)
        self.width_ = self.width_factor * (self.centers_[1] - self.centers_[0])
        return self
        
    def transform(self, X):
        return self._gauss_basis(X[:, :, np.newaxis], self.centers_,
                                 self.width_, axis=1)
In [116]:
# Use 10 Gaussian bases to construct the regression model. 
gauss_model = make_pipeline(GaussianFeatures(10),
                            linear_model.LinearRegression())
gauss_model.fit(x[:, np.newaxis], y)
yfit = gauss_model.predict(xfit[:, np.newaxis])

plt.scatter(x, y, c="r", label="Raw Data")
plt.plot(xfit, yfit, c = 'black',linewidth =1, label="Best Fit Line")
plt.plot(x_t, np.sin(x_t), label="$y=sin(x)$", alpha=0.4)
plt.xlim(0, 10)
plt.ylim(-1.5,1.5)
plt.legend();
In [121]:
# Use 30 Gaussian bases to construct the regression model. 
gauss_model = make_pipeline(GaussianFeatures(30),
                            linear_model.LinearRegression())
gauss_model.fit(x[:, np.newaxis], y)
yfit = gauss_model.predict(xfit[:, np.newaxis])

# Visualization
plt.scatter(x, y, c="r", label="Raw Data")
plt.plot(xfit, yfit, c = 'black',linewidth =1, label="Best Fit Line")
plt.plot(x_t, np.sin(x_t), label="$y=sin(x)$", alpha=0.4)
plt.xlim(0, 10)
plt.ylim(-4,1.5)
plt.legend();

With 30-dimensional basis, the model has far too much flexibility and goes to extreme values between locations where it is constrained by data. To get rid of this problem, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. Hence shrinkage methods can also perform variable selection. We will introduce two types of shrinkage methods:

  • Ridge Regression
  • Lasso Regression

Ridge Regression ($L_2$ regularization)¶

The most common form of regularization is known as ridge regression or $L_2$ regularization. This proceeds by penalizing the sum of squares of the model coefficients. Recall the least squares fitting procedure estimates $\beta_0,\beta_1,...,\beta_p$ using the values that minimize $$ RSS = \sum_{i = 1}^n\left(y_i - \beta_0-\sum_{j = 1}^p \beta_jx_{ij}\right)^2 $$

The ridge regression coefficient estimates $\beta_0,...,\beta_p$ by minimizing $$ \sum_{i = 1}^n\left(y_i - \beta_0-\sum_{j = 1}^p \beta_jx_{ij}\right)^2 + \lambda\sum_{j = 1}^p\beta_j^2, $$ where $\lambda$ is a tuning parameter that controls the strength of the penalty. This type of penalized model is build into Scikit-Learn with Ridge estimator:

In [123]:
from sklearn.linear_model import Ridge
model = make_pipeline(GaussianFeatures(30), Ridge(alpha=0.1))
model.fit(x[:, np.newaxis], y)
yfit = model.predict(xfit[:, np.newaxis])

## Visualization
plt.scatter(x, y, c="r", label="Raw Data")
plt.plot(xfit, yfit, c = 'black',linewidth =1, label="Best Fit Line")
plt.plot(x_t, np.sin(x_t), label="$y=sin(x)$", alpha=0.4)
plt.xlim(0, 10)
plt.ylim(-1.5,1.5)
plt.legend();

Lasso regression ($L_1$ regularization)¶

Another very common type of regularization is Lasso regression and it involves penalizing the sum of absolute values of regression coefficients.

One obvious disadvantage of ridge regression is the the penalty $\lambda\sum\beta_j^2$ will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero. This can create a challenge in the model interpretation in setttings in which the number of variables $p$ is quite large.

The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients are estimated by minimizing $$ \sum_{i = 1}^n\left(y_i - \beta_0-\sum_{j = 1}^p \beta_jx_{ij}\right)^2 + \lambda\sum_{j = 1}^p|\beta_j|, $$

In [127]:
from sklearn.linear_model import Lasso
model = make_pipeline(GaussianFeatures(30), Lasso(alpha=0.001))
model.fit(x[:, np.newaxis], y)
yfit = model.predict(xfit[:, np.newaxis])
## Visualization
plt.scatter(x, y, c="r", label="Raw Data")
plt.plot(xfit, yfit, c = 'black',linewidth =1, label="Best Fit Line")
plt.plot(x_t, np.sin(x_t), label="$y=sin(x)$", alpha=0.4)
plt.xlim(0, 10)
plt.ylim(-1.5,1.5)
plt.legend();

Logistic Regression¶

Linear Classifier¶

The basic idea behind a linear classifier two target classes can be separated by a hyperplane in the feature space. If this can be done without error, the training set is called linearly separable.

We will introduce the logistic regression using the Default dataset provided by ISRL R package. This is simulated data set containing information on ten thousand customers. The goal is to predict which customers will default on their credit card debt based on their balance and income.

Columns:

  • default: A factor with levels No and Yes indicating whether the customer defaulted on their debt
  • balance: The average balance that the customer has remaining on their credit card after making their monthly payment
  • income: Income of customers
In [137]:
df = pd.read_csv('https://raw.githubusercontent.com/jtao/AdvancedML/main/data/Default.csv')
# Note: factorize() returns two objects: a label array and an array with the unique values.
# We are only interested in the first object. 
df['default2'] = df.default.factorize()[0]
#df[df["default"]=="Yes"].head()
df.head()
Out[137]:
default balance income default2
0 No 729.526495 44361.625074 0
1 No 817.180407 12106.134700 0
2 No 1073.549164 31767.138947 0
3 No 529.250605 35704.493935 0
4 No 785.655883 38463.495879 0
In [140]:
fig = plt.figure(figsize=(12,5))
gs = mpl.gridspec.GridSpec(1, 4)
ax1 = plt.subplot(gs[0,:-2])
ax2 = plt.subplot(gs[0,-2])
ax3 = plt.subplot(gs[0,-1])

# Take a fraction of the samples where target value (default) is 'no'
df_no = df[df.default2 == 0].sample(frac=0.15)
# Take all samples  where target value is 'yes'
df_yes = df[df.default2 == 1]
df_ = df_no.append(df_yes)

ax1.scatter(df_[df_.default == 'Yes'].balance, df_[df_.default == 'Yes'].income, s=40, c='orange', marker='+',
            linewidths=1)
ax1.scatter(df_[df_.default == 'No'].balance, df_[df_.default == 'No'].income, s=40, edgecolors='lightblue',facecolors = 'white', marker='o',
            linewidths=1,alpha = 0.6)

ax1.set_ylim(ymin=0)
ax1.set_ylabel('Income')
ax1.set_xlim(xmin=-100)
ax1.set_xlabel('Balance')

c_palette = {'No':'lightblue', 'Yes':'orange'}
sns.boxplot('default', 'balance', data=df, orient='v', ax=ax2, palette=c_palette)
sns.boxplot('default', 'income', data=df, orient='v', ax=ax3, palette=c_palette)
gs.tight_layout(plt.gcf())

Consider the Default data set where the response default falls into one of two categories, Yes or No. Rather than modeling this reponse Y directly, logistic regression models the probability that Y belongs to a particular category. How should we model the relationship between $p(X) = P(Y = 1|X)$ and $X$? (For convenience we are using the generic 0/1 coding for the response).

Logistic regression belongs to generalized linear models, i.e, it's a linear model with a link function that maps the output of linear multiple regression to the posterior probability of class 1 using the logistic sigmoid function: $$ p(X) = \frac{e^{\beta_0+\beta_1X_1 + \cdots +\beta_p X_p}}{1+e^{\beta_0+\beta_1X_1 + \cdots +\beta_p X_p}} $$ where $X = (X_1,...,X_p)$ are p predictors. After a bit of manipulation of the equation, we find that $$ \log(\frac{p(X)}{1-p(X)}) = \beta_0+\beta_1X_1 + \cdots +\beta_p X_p $$ The quantity $p(X)/1-p(X)$ is called the odds, and can take on any value between $0$ and $\infty$. The term $\log(\frac{p(X)}{1-p(X)})$ is called the log-odds or logit.

Estimate the Regression Coefficients¶

The coefficients $\beta_0$ and $\beta_1$ are unknown, and must be estimated based on the avaiable training data. In the linear regression, we use the least squares approach to estimate the unknown linear regression coefficients. For logistic regression, the loss function for sample $i$ is the negative log of the probability: $$ l(\beta_0,\beta_1) = \prod_{i:y_i = 1}p(x_i)\prod_{i':y_{i'} = 0}(1-p(x_{i'})) $$ The estimates of $\hat{\beta}_0$ and $\hat{\beta}_1$ are chosen to maximize the loss function.

Logistic (Sigmoid) Function is presented below:

In [27]:
def logistic(x): return 1 / (1 + np.exp(-x))
x = np.linspace(-6, 6, 100)
plt.plot(x, logistic(x)) 
plt.grid(True) 
plt.title('Logistic (sigmoid)')
Out[27]:
Text(0.5, 1.0, 'Logistic (sigmoid)')
In [141]:
X = df[['balance', 'income']]
y = df.default2
clf = skl_lm.LogisticRegression(solver='newton-cg')
clf.fit(X,y)
Out[141]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

Assess the Accuracy of the Model¶

In the field of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an supervised learning algorithm. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa).

In [142]:
y_pred = clf.predict(X)
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y, y_pred)
pd.DataFrame(confusion_matrix,columns = ["No","Yes"],index = ["No","Yes"])
Out[142]:
No Yes
No 9630 37
Yes 225 108

The confusion matrix tells us that we have $9630 + 108$ correct predictions and $225 + 37$ wrong predictions.

Exercise¶

The Auto dataset provides the information for 392 vehicles. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.The data set has 392 observations with 9 variables:

  • mpg: miles per gallon
  • cylinders: Number of cylinders between 4 and 8
  • displacement: Engine displacement
  • horsepower: Engine horsepower
  • weight: Vehicle weight(lbs)
  • acceleration: Time to accelerate from 0 to 60 mph (sec.)
  • year: Model year
  • origin: Origin of car (1. American, 2. European, 3. Japanese)
  • name: Vehicle name

(a) Load the data Auto.csv and remove the missing values from the dataset.¶

Instructions:

  1. Use the read_csv function in pandas to read the dataset.
  2. There are 16 missing values in the original dataset and they are marked with "?". The argument na_values can help to identy the missing values when the dataset is read.
  3. The dropna function will drop the observations with missing values automatically.
In [144]:
auto = pd.read_csv('https://raw.githubusercontent.com/jtao/AdvancedML/main/data/Auto.csv', na_values='?').dropna()
auto.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    int64  
 8   name          392 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 30.6+ KB
In [145]:
auto.head()
Out[145]:
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 1 ford torino

(b) Perform a simple linear regression with mpg as the response (target) and horsepower as the predictor. Calculate the $R^2$. Comment on the output.¶

For example:

  1. Is there a relationship between the predictor and the response? Is the relationship between the predictor and the response postive or negative?
  2. How strong is the relationship between the predictor and the response?
In [147]:
Horsepower = auto.horsepower.to_numpy().reshape(-1, 1)
Mpg = auto.mpg.to_numpy().reshape(-1, 1)
plt.scatter(Horsepower, Mpg, facecolors='None', edgecolors='k', alpha=.5);
In [153]:
# Linear regression
## generate new dataset 
Horsepower_test = np.arange(40, 230, step = 0.1).reshape(-1, 1)

## Linear Regression Model: Rating ~ Year
from sklearn import linear_model
fit_lm = linear_model.LinearRegression()
fit1 = fit_lm.fit(Horsepower, Mpg)
y1 = fit1.predict(Horsepower_test)

## Print the coefficients (beta_0 and beta_1)
print("Coefficient of Horsepower :", fit1.coef_[0])
print("Model intercept:", fit1.intercept_)

## Calcuate the R2
Mpg_pred = fit_lm.predict(Horsepower)
print("R2:",r2_score(Mpg, Mpg_pred))
Coefficient of Horsepower : [-0.15784473]
Model intercept: [39.93586102]
R2: 0.6059482578894348
In [154]:
##Visualize the fitting result 
plt.scatter(Horsepower, Mpg, color = "white", edgecolor = "black", label="Raw Data")
plt.plot(Horsepower_test, y1, '-', color = "r", linewidth = 2, label="Linear Regression")
plt.xlabel("horsepower")
plt.ylabel("mpg")
plt.legend();
  1. Is there a relationship between the predictor and the response? Is the relationship between the predictor and the response postive or negative?

Yes, with the increase of the horsepower, the mpg decreases accordingly and this suggests that there is negative a relationship between the predictor and the response.

  1. How strong is the relationship between the predictor and the response?

For a unit increase in horsepower, the mpg fitted by model will decrease by 0.1578. The $R^2 = 0.61$ suggests that about $61\%$ variability in mpg is explained by the linear regression on horsepower.

(c) Fit the regression model with higher order polynomials, e.g., order = 2, 5. Comment on the output and consider how to improve the performance of the model with order = 5 using regularization (e.g., $l_1$ penalty).¶

In [155]:
## Polynomial Regression Model: Rating ~ Year + Year^2
from sklearn.preprocessing import PolynomialFeatures
Horsepower_poly_2 = PolynomialFeatures(degree = 2).fit_transform(Horsepower)
fit2 = fit_lm.fit(Horsepower_poly_2, Mpg)
y2 = fit2.predict(PolynomialFeatures(degree = 2).fit_transform(Horsepower_test))

## Polynomial Regression Model: Rating ~ Year + Year^2 + Year^3 + Year^4 + Year^5
Horsepower_poly_5 = PolynomialFeatures(degree = 5).fit_transform(Horsepower)
fit5 = fit_lm.fit(Horsepower_poly_5, Mpg)
y5 = fit5.predict(PolynomialFeatures(degree = 5).fit_transform(Horsepower_test))
In [162]:
## Visualization
plt.scatter(Horsepower, Mpg, color = "white", edgecolor = "black", label="Raw Data", alpha=0.4)
plt.plot(Horsepower_test, y2, '-', color = "red", linewidth = 2, label="Polynomial - Degree 2")
plt.plot(Horsepower_test, y5, '-', color = "green", linewidth = 2, label="Polynomial - Degree 5")
plt.xlabel("horsepower")
plt.ylabel("mpg")
plt.legend();
In [165]:
from sklearn.linear_model import Lasso
model_lasso = make_pipeline(PolynomialFeatures(degree = 5), Lasso(alpha=0.01))
model_lasso.fit(Horsepower_poly_5,Mpg)
yfit_lasso = model_lasso.predict(PolynomialFeatures(degree = 5).fit_transform(Horsepower_test))

plt.scatter(Horsepower, Mpg, color = "white", edgecolor = "black", label="Raw Data", alpha=0.4)
plt.plot(Horsepower_test, y2, '-', color = "red", linewidth = 2, label="Polynomial - Degree 2")
plt.plot(Horsepower_test, y5, '-', color = "green", linewidth = 2, label="Polynomial - Degree 5")
plt.plot(Horsepower_test, yfit_lasso, '-', color = "orange", linewidth = 2, label="Lasso")
plt.legend();

Gaussian Process Regression¶

Gausian processes are a supervised learning method designed to solve regression and probabilistic classification problems.

The advantages of Gaussian processes are:

  • The prediction interpolates the observations (at least for regular kernels).
  • The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest.
  • Versatile: different kernels can be specified. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of Gaussian processes include:

  • They are not sparse, i.e., they use the whole samples/features information to perform the prediction.
  • They lose efficiency in high dimensional spaces – namely when the number of features exceeds a few dozens.

Implementation of Gaussian processes for regression purposes gives the Gaussian process regression (GPR). The prior of the GP needs to be specified. The prior mean is assumed to be constant. The prior covariance is specified through a kernel object.

The Bayesian approach infers a probability distribution over all possible values. Assume the linear function as following, $$\mathbf{y} = \mathbf{w}\mathbf{x} + \boldsymbol{\epsilon}$$

The Bayesian approach specifies a prior distribution for $p(\mathbf{w})$ on the parameter $\mathbf{w}$ and relocating probabilities based on the observed data using Bayes' Rule: $$p(\mathbf{w}|\mathbf{y}, \mathbf{x}) = \frac{p(\mathbf{y}|\mathbf{x},\mathbf{w})p(\mathbf{w})}{p(\mathbf{y}|\mathbf{x})}$$

To get the predictions at unseen points of interest $mathbf{x}^*$, the predictive distribution can be calculated by weighting all possible predictions by the posterior distribution $$p(f^*|\mathbf{x}^*, \mathbf{y},\mathbf{x}) = \int_{\mathbf{w}}p(f^*|x^*, w)p(w|y, x)$$

In GPR, we first assume a Gaussian process prior through the mean and covariance function, like an infinite-dimensional multivariate Gaussian distribution. $$f(x)\sim GP(m(x), k(x, x'))$$ $$y\sim GP(m(x), k(x, x') + \delta_{ij}\sigma^2), \text{ where }y = f(x) + \epsilon, \epsilon \sim N(0, \sigma^2)$$ From the Gaussian process prior, the collection of training and test data points are joint multivariate Gaussian distributed, so the joint distribution can be written as $$\left[\begin{array}{c} y\\ f^* \end{array}\right]\sim \mathcal{N}\left(\boldsymbol{\mu}^*, \boldsymbol{\Sigma}^*\right)$$ where $$\boldsymbol{\mu}^* = [\boldsymbol{\mu}, \mu^*], \boldsymbol{\Sigma^*} = \left[\begin{array}{cc} K(X, X)+\sigma^2 I & K(X, x^*)\\ K(x^*, X) & K(x^*, x^*)\end{array}\right]$$

Example¶

Random generate data from a uniform distribution.

In [21]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern, ConstantKernel as C
In [22]:
rng = np.random.RandomState(1)
X = np.linspace(0, 10, 10)
y = X * np.sin(X)
dy = 0.1 + 0.5 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise
# GaussFeats = GaussianFeatures(10).fit_transform(x.reshape(-1, 1))

plt.figure(figsize = (6, 6))
# plt.subplot(1,2,1)
plt.scatter(X, y)
plt.xlabel('x')
plt.ylabel('y')

# plt.subplot(1,2,2)
# for i in range(10):
#     plt.plot(x, GaussFeats[:, i], '--', color = 'C'+ str(i),  linewidth = 0.5)

# plt.xlabel('x')
# plt.ylabel('Gaussian Features')

plt.show()
In [23]:
x = np.linspace(0, 12, 100)

# Instantiate a Gaussian Process model
kernel = Matern(nu = 2.5)
gp = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 9)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X.reshape(-1, 1), y)

# Make the prediction on the testing dataset (ask for MSE as well)
y_pred, sigma = gp.predict(x.reshape(-1, 1), return_std=True)

# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
plt.figure(figsize=(10,6))
plt.plot(x, x * np.sin(x), 'r:', label=r'$f(x) = x\,\sin(x)$')
plt.scatter(x, y_pred, color = 'black', label='Prediction')
plt.errorbar(X, y, dy, label = 'Observation', color = 'red')
plt.fill(np.concatenate([x.reshape(-1, 1), x.reshape(-1, 1)[::-1]]),
         np.concatenate([y_pred - 1.9600 * sigma,
                        (y_pred + 1.9600 * sigma)[::-1]]),
         alpha=.5, fc='b', ec='None', label='95% confidence interval')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.ylim(-15, 10)
plt.legend(loc='upper left')

plt.show()

A real data example¶

The data relates to an inverse dynamics problem for a seven degrees-of-freedom SARCOS anthropomorphic robot arm. The task is to map from a 21-dimensional input space (7 joint positions, 7 joint velocities, 7 joint accelerations) to the corresponding 7 joint torques. Following previous work we present results for just one of the seven mappings, from the 21 input variables to the first of the seven torques. The training data is of size $44484 \times 28$ and the test data is of size $4449 \times 28$. The first 21 columns are the input variables, and the 22nd column is used as the target variable.

In [27]:
training_data = pd.read_csv('https://raw.githubusercontent.com/XiaomengYan/MachineLearning_dataset/main/sarcos_inv.csv')
test_data = pd.read_csv('https://raw.githubusercontent.com/XiaomengYan/MachineLearning_dataset/main/sarcos_inv_test.csv')

X_train = training_data.iloc[:100, 1:22]
y_train = training_data.iloc[:100, 22]
X_test = test_data.iloc[:, 1:22]
y_test = test_data.iloc[:, 22]
In [28]:
# Instantiate a Gaussian Process model
kernel = Matern(nu = 2.5)
gp = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 9)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X_train, y_train)

# Make the prediction on the testing dataset (ask for MSE as well)
y_pred, sigma = gp.predict(X_test, return_std=True)

MSE_test = np.mean(np.power(y_pred - y_test, 2))
print("MSE for the test set by GPR: ", MSE_test)

# Linear & Ridge
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge, LinearRegression

regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred_lr = regr.predict(X_test)

MSE_test = np.mean(np.power(y_pred_lr - y_test, 2))
print("MSE for the test set by linear model: ", MSE_test)

ridge_regr = Ridge(alpha = 2.0)
ridge_regr.fit(X_train,y_train)
y_pred_rg = ridge_regr.predict(X_test)

MSE_test = np.mean(np.power(y_pred_rg - y_test, 2))
print("MSE for the test set by Ridge model: ", MSE_test)
MSE for the test set by GPR:  362.69497537392914
MSE for the test set by linear model:  711519.7547611272
MSE for the test set by Ridge model:  246.41148282513618
In [24]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import svm

# loading data
from sklearn.datasets import load_digits

data = load_digits()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=17              #test_size controls the proportion of test data in the whole data
)
n_samples = len(data.images)
print(n_samples)
1797
In [25]:
C_list = np.array([0.001, 0.1, 0.5, 1, 10, 100])
kernel_list = ['linear', 'poly', 'rbf', 'sigmoid']

accuracy_score_dict = dict()
for cc in C_list:
    for kern in kernel_list: 
        clf_svm = svm.SVC(C = cc, kernel = kern)
        clf_svm.fit(X_train, y_train)
        svm_pred = clf_svm.predict(X_test)
        accuracy_score_dict[kern + '_' + str(cc)] = accuracy_score(y_test,svm_pred)
In [26]:
clf_svm = svm.SVC(C = 0.1, kernel = 'poly')
clf_svm.fit(X_train, y_train)
svm_pred = clf_svm.predict(X_test)

_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, prediction in zip(axes, X_test, svm_pred):
    ax.set_axis_off()
    image = image.reshape(8, 8)
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title(f'Prediction: {prediction}')

Advanced Forecasting Model - Facebook Prophet¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Background¶

The Prophet forecasting model is based on the following formula $$y(t) = g(t) + s(t) + h(t) + \epsilon_t,$$ where $y(t)$ are the observations at time $t, t=1,\dots, T$ (training data). $g(t)$, $s(t)$, $h(t)$ are the trend, seasonality, holiday components respectively.

The trends are modeled by a logistic (S-shape trend) function or a piecewise linear function. The seasonalities are modeled by the Fourier series. The holidays are modeled by the indicators. The illustration of the detailed model can be found in https://peerj.com/preprints/3190v2/

The forecasting procedure is: first obtain the estimating of these three components $\widehat{g}(t)$, $\widehat{s}(t)$, $\widehat{h}(t)$. At the new time point $T+1$, forecast the output value $\widehat{y}(T+1) = \widehat{g}(T+1) + \widehat{s}(T+1) + \widehat{h}(T+1)$.

Pros of Prophet:

  1. Flexibility: We can easily accommodate seasonality with multiple periods and let the analyst make different assumptions about trends.
  2. Not neccessary time regularly spaced
  3. Fast computation
  4. Easily interpretable components
In [166]:
# install facebook prophet if it is not installed yet
#!pip install prophet

Read Data¶

To utilize Facebook Prophet forecasting model, the column names must be ds (the data timestamps) and y (the numeric values) respectively.

In [167]:
# Sample Energy consumption data of in UK.
# https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households
    
df1 = pd.read_csv('https://raw.githubusercontent.com/jtao/AdvancedML/main/data/Energy.csv', parse_dates=["day"])
In [168]:
df1.columns = ["ds","y"]  # change the column names
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 673 entries, 0 to 672
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   ds      673 non-null    datetime64[ns]
 1   y       673 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 10.6 KB

Forecasting¶

We then use the Prophet function to forecast the future values. To verify the forecasting model, we separate the electricity daily data into two parts. First part is the values from day 1 to day 650 as the training data, while the second part is the rest of days.

The arguments daily_seasonality, yearly_seasonality, and weekly_seasonality can be manually controlled based on training data. interval_width from 0 to 1 controls the size of forecasting confidence intervals. The higher interval_width is, the larger confidence interval is.

We also consider the holiday effects during forecasting, since the holidays do affect the electricity consumptions.

In [173]:
# import Prophet
from prophet import Prophet

# model fitting and forecasting
m = Prophet(daily_seasonality = False, yearly_seasonality = True, weekly_seasonality = True,interval_width=0.8)
m.add_country_holidays(country_name='US')
m.fit(df1[1:650])
future = m.make_future_dataframe(periods=31) # future 31 time points

forecast = m.predict(future) # forecast
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
Out[173]:
ds yhat yhat_lower yhat_upper
675 2014-03-04 0.054982 0.041383 0.067977
676 2014-03-05 0.054994 0.041432 0.068570
677 2014-03-06 0.054566 0.040596 0.066922
678 2014-03-07 0.053382 0.039734 0.066639
679 2014-03-08 0.053657 0.040296 0.067371
In [174]:
m.train_holiday_names
Out[174]:
0                 New Year's Day
1      New Year's Day (Observed)
2     Martin Luther King Jr. Day
3          Washington's Birthday
4                   Memorial Day
5               Independence Day
6                      Labor Day
7                   Columbus Day
8                   Veterans Day
9        Veterans Day (Observed)
10                  Thanksgiving
11                 Christmas Day
dtype: object
In [175]:
# plot the model fitting and forecasting with pointwise confidence intervals.
fig1 = m.plot(forecast)
In [44]:
# different components 
fig2 = m.plot_components(forecast)
In [45]:
# interactive plots
from prophet.plot import plot_plotly, plot_components_plotly

plot_plotly(m, forecast)

Automatic anomaly detection¶

We develop the anomalydetection function to detect the anomalies via Facebook Prophet.

The inputs are data_train, data_test, and interval_width, while the outputs are the labels of normal data or anomalies.

Also, we can manually control the different seasonalities in the inner function.

In [184]:
# The automatic anomaly detection function
# inputs
# data_train: the training data consist two columns: first column 'ds' represents the timestamps; 
#               second column 'y' represents the numeric values.
# data_test: the testing data with two columns similar to data_train.
# interval_width: the width of prediction interval
# outputs
# labels: the labels indicating the testing data are anomalies or not.

def anomalydetection(data_train,data_test,interval_width):
    from prophet import Prophet
    m = Prophet(daily_seasonality = False, yearly_seasonality = True, weekly_seasonality = True,interval_width=interval_width)
    m.add_country_holidays(country_name='US')
    m.fit(data_train)
    future = m.make_future_dataframe(periods=data_test.shape[0])
    forecast = m.predict(future)

    intervals = forecast[(data_train.shape[0]):(data_train.shape[0]+data_test.shape[0])][['ds','yhat_lower','yhat_upper']]

    labels = []
    for i in range(data_test.shape[0]):
        if data_test.iloc[i-1].at['y'] > intervals.iloc[i-1].at['yhat_lower'] and data_test.iloc[i-1].at['y'] < intervals.iloc[i-1].at['yhat_upper']:
            labels.append('Normal')
        else:
            labels.append('Anomaly')
    return labels
In [186]:
anomalydetection(df1[0:650],df1[650:673],interval_width=0.95)
Out[186]:
['Anomaly',
 'Normal',
 'Anomaly',
 'Normal',
 'Normal',
 'Normal',
 'Anomaly',
 'Normal',
 'Normal',
 'Normal',
 'Normal',
 'Normal',
 'Anomaly',
 'Normal',
 'Normal',
 'Normal',
 'Normal',
 'Normal',
 'Normal',
 'Normal',
 'Normal',
 'Normal',
 'Normal']
In [ ]: